Date’s problem

Author

Santiago Sotelo

Published

February 21, 2023

Problem

  • The date-time variables appear in different formats in the data downloaded from BecharaReport API.
  • Some values of date-time variables does not reflect the date-time of the user considering their timezone.

Objectives

  1. Describe the problem.

  2. Explain the origin of the problem.

For the purpose of this report, we are going to use the data from BecharaReport API gathered in 2023-01-25. This data has 799362 observations and 80 variables, and can be found in the following Dropbox path:

Mobio Interactive Dropbox/Research/RWD/Shared R Scripts/Input csv/BecharaReportFull-2023-01-25T17_00_16.csv.

Obj 1: Describing the problem

In the AmDTx data downloaded from BecharaReport API we have the following 10 date-time variables:

Definitions from BecharaReport+
Official label Definition
created Timestamp of account creation
session_start Unique timestamp of AmDTx session start
session_finish Unique timestamp of AmDTx session finish
snapshot_start_pre Unique timestamp of AmDTx snapshot start (pre session or standalone)
snapshot_finish_pre Unique timestamp of AmDTx snapshot finish (pre session or standalone)
snapshot_start_post Unique timestamp of AmDTx snapshot start (post session or standalone)
snapshot_finish_post Unique timestamp of AmDTx snapshot finish (post session or standalone)
intent_start Timestamp of user initiating Intent portion of My Moment feature
intent_finish Timestamp of user completing Intent portion of My Moment feature
redeem_date When a coupon was applied to a user’s account
badges List of badges that the user has been awared up to the snapshot/session initiation

This variables appear in the database in the following way:

Selection of time & datetime variables considering user ID and session ID

Date-time formats

There are different time formats for date-time information:

Date-time formats from UTC Time Now
Date-Time Format UTC Date Time Now
UTC 2023-01-25T13:47:24Z
ISO-8601 2023-01-25T13:47:24+0000
RFC 2822 Wed, 25 Jan 2023 13:47:24 +0000
RFC 850 Wednesday, 25-Jan-23 13:47:24 UTC
RFC 1036 Wed, 25 Jan 23 13:47:24 +0000
RFC 1123 Wed, 25 Jan 2023 13:47:24 +0000
RFC 822 Wed, 25 Jan 23 13:47:24 +0000
RFC 3339 2023-01-25T13:47:24+00:00
ATOM 2023-01-25T13:47:24+00:00
COOKIE Wednesday, 25-Jan-2023 13:47:24 UTC
RSS Wed, 25 Jan 2023 13:47:24 +0000
W3C 2023-01-25T13:47:24+00:00
Unix Epoch 1674654444
YYYY-DD-MM HH:MM:SS 2023-25-01 13:47:24
YYYY-DD-MM HH:MM:SS am/pm 2023-25-01 01:47:24 PM
DD-MM-YYYY HH:MM:SS 25-01-2023 13:47:24
MM-DD-YYYY HH:MM:SS 01-25-2023 13:47:24

In BecharaReport data we can see date-time variables formatted in UTC format and W3C format.

In W3C format, the +00:00 at the end of the string means how many hh:mm the time is offset from UTC (Universal Time Coordinated). To know how many hours offset a place is from UTC, we can follow this map time, although in practice there are some exceptions:

UTC offset world map

Let’s enumerate the following incongruencies in the date-time variables:

Format problem

  • W3C format: created, session_start, session_finish, snapshot_start_pre, snapshot_finish_pre, snapshot_start_post, snapshot_finish_post, redeem_date, badge
  • UTC format: intent_start, intent_finish
    • This variables are already defaulted UTC +00:00.

Atypic values instead of NA

  • Intent start & finish are variables present in “My Moment” sessions. In case of other types of sessions the backend is defaulting the missing value as 0001-01-01T00:00Z.

Number of distinct observations equal to 0001-01-01T00:00Z by different types in BecharaReport API data:

intent_start
type intent_start n % in type
<snapshot> 0001-01-01T00:00Z 1371 6.17
ac2048 0001-01-01T00:00Z 137 100.00
JourneyActivity 0001-01-01T00:00Z 229 51.12
JourneyLesson 0001-01-01T00:00Z 20482 86.74
JourneyTimer 0001-01-01T00:00Z 433 86.43
Library 0001-01-01T00:00Z 7880 80.38
None 0001-01-01T00:00Z 4 66.67
Timer 0001-01-01T00:00Z 3846 80.88
intent_finish
type intent_finish n % in type
<snapshot> 0001-01-01T00:00Z 1371 6.17
ac2048 0001-01-01T00:00Z 137 100.00
JourneyActivity 0001-01-01T00:00Z 229 51.12
JourneyLesson 0001-01-01T00:00Z 20482 86.74
JourneyTimer 0001-01-01T00:00Z 433 86.43
Library 0001-01-01T00:00Z 7880 80.38
None 0001-01-01T00:00Z 4 66.67
Timer 0001-01-01T00:00Z 3846 80.88

Time-zone problem

  • The date-time variables define the UTC offset +00:00 according to the location of the user. When some date-time variables are not able to retrieve the GPS location of the user, then it defaults UTC to +00:00.

  • This is a problem because if we are interested in the local time of the user (e.g if the user used the AmDTx app in the morning or at night) then defaulting to +00:00 would change that local time.

  • For example, if the user is in Lima, Peru (UTC -05:00 ) and uses the app at 8:49pm, it would be defaulted to 1:49am.

    Date-time conversion
    Peru time (EST) / UTC -05:00 UTC +00:00
    2023-01-10T20:49:33-05:00 2023-01-11 01:49:33 UTC

All W3C format date-time variables are defaulted to UTC +00:00 with the exception of badge variable.

created n
+00:00 92386
session_start n
+00:00 46097
NA 46289
session_finish n
+00:00 46097
NA 46289
snapshot_start_pre n
+00:00 43971
NA 48415
snapshot_finish_pre n
+00:00 43971
NA 48415
snapshot_start_post n
+00:00 19242
NA 73144
snapshot_finish_post n
+00:00 19242
NA 73144
redeem_date n
+00:00 44184
NA 48202

Number of distinct observations defaulted to UTC +00:00:

Badge situation

As its definition says, badge represents a list of badges the user gains by using the app. This variable records the user badges in the following format:

  • {badge1,badge2,badge3}
  • For example: {B001:2019-07-18T12:15:40-07:00,B004:2019-07-25T07:00:54-07:00,B003:2019-06-18T12:15:40-07:00}
  • Not necessarily in order.
Displaying distinct observations in badge variable

As we can see, badge variable is recording the UTC offset of the badge gain and this can provide a future solution for the date-time problem if we apply the same principle of date-time recording to the other date-time variables.

Obj 2: Origin of the problem

As in the duplicate’s problem, part of the problem could be in the extraction of the data by BecharaReport API. Although, for example the atypic values problem also appear in the SQLPad database.

Further explanation about the problem could be found in the way SQL’s timestamp & timestampz are being used to store the date-time in AmDTx database. More documentation about this variables could be found in Cockroach Labs, MySQL

Information from SQL Pad
Official label Variable type in database
created timestampz
session_start timestampz
session_finish timestampz
snapshot_start_pre timestamp
snapshot_finish_pre timestamp
snapshot_start_post timestamp
snapshot_finish_post timestamp
intent_start timestamp
intent_finish timestamp
redeem_date timestamp
badges varchar

There is a variable called last_access_utc in user_token table which does not appear in BecharaReport+, further gathering in this variable is recommended.